Bone conduction facilitates self-other voice discrimination

One's own voice is one of the most important and most frequently heard voices. Although it is the sound we associate most with ourselves, it is perceived as strange when played back in a recording. One of the main reasons is the lack of bone conduction that is inevitably present when hearing one's own voice while speaking. The resulting discrepancy between experimental and natural self-voice stimuli has significantly impeded self-voice research, rendering it one of the least investigated aspects of self-consciousness. Accordingly, factors that contribute to self-voice perception remain largely unknown. In a series of three studies, we rectified this ecological discrepancy by augmenting experimental self-voice stimuli with bone-conducted vibrotactile stimulation that is present during natural self-voice perception. Combining voice morphing with psychophysics, we demonstrate that specifically self-other but not familiar-other voice discrimination improved for stimuli presented using bone as compared with air conduction. Furthermore, our data outline independent contributions of familiarity and acoustic processing to separating the own from another's voice: although vocal differences increased general voice discrimination, self-voices were more confused with familiar than unfamiliar voices, regardless of their acoustic similarity. Collectively, our findings show that concomitant vibrotactile stimulation improves auditory self-identification, thereby portraying self-voice as a fundamentally multi-modal construct.


Introduction
We are all familiar with the strange sensation that occurs when we hear our voice in video or voice recordings [1][2][3][4][5]. Considering the fundamental role our voice plays in our everyday communication, this should be quite surprising. We have a lifelong daily exposure to our voice, higher than exposure even to the most familiar voices. Our own voice is the sound most intimately linked to our self. Although there is ample evidence showing that self-related stimuli are perceived differently and activate distinct cortical regions compared with other, non-self-associated stimuli [6][7][8][9][10][11][12][13][14], the specific mechanisms of self-voice perception have been surprisingly under-investigated, both in behavioural and neuroimaging studies [15][16][17]. For instance, the extent to which self-voice perception differs from that of other familiar voices remains poorly understood; as does the extent to which acoustic properties that enable discriminating voices of other people [18] are involved in self-other voice discrimination (VD). A better understanding of self-voice perception is of immediate clinical relevance, as deficits in self-other VD have been related to auditory-verbal hallucinations (AVHs) [19][20][21][22] (i.e. 'hearing voices'), one of the most common [23,24] and most distressing [25,26] hallucinations in a major psychiatric disorder, schizophrenia. Investigating different perceptual factors underlying selfother VD, we here hypothesized that one key contribution would stem from bone conduction and, based on our findings, propose a new experimental paradigm that improves the ecological validity for studying self-voice perception.
A crucial contribution for the perception of our own voice, and our own voice only, comes from bone conduction resulting from speech production/articulation. Under natural conditions, one's spoken voice is transmitted not only through the air, but also, unfailingly through the skull [27,28], which alters selfvoice perception in two ways. First, due to the different sound propagation, bone conduction transforms the sound of our voice-specifically, it is assumed to instantiate a low-pass filter [29,30]. Because of the low-frequency emphasis, we hear our voice as lower [29] compared with how our voice sounds to others. Second, next to transforming the sound of our voice, bone conduction conveys additional sensory information, as not only auditory, but also vibrotactile [31] and somatosensory [32,33] signals are involved, resulting from the vibrations of the skull and skin deformation. Thus, self-voice, when heard under natural conditions, is not only an auditory but rather a multi-modal percept.
One reason for the scarcity of self-voice studies probably lies in methodological obstacles faced when creating appropriate experimental stimuli. Without bone conduction, prior self-voice studies inevitably contain a perceptual mismatch between the experimental self-voice stimuli (e.g. presented through airconducting loudspeakers) and the actual self-voice. In fact, the majority of studies that compared recognition of self-voice versus other voices reported lower accuracy rates and higher response times for self-voice compared with other voices [16,[34][35][36][37][38][39][40][41][42][43][44][45][46][47][48]. Early self-voice studies suggested that this discrepancy between self-and other voices might result from a lower previous exposure to self-voice in voice recordings [34,35,37]. However, similar behavioural differences still persist [16,[36][37][38][39][40][41]45], with a higher exposure to recorded self-voice through contemporary technology (e.g. voice messages and video recordings). Moreover, more recent self-voice paradigms often demonstrate ceiling effects [37,[39][40][41][46][47][48][49], e.g. high accuracy rates in all experimental conditions, reflecting a need for more sensitive experimental paradigms. To account for the aforementioned ecological discrepancy, several studies investigated if acoustic transformations (e.g. low-pass or other types of filters) of air-conducted self-voice stimuli would render the self-voice more natural to the listeners. These attempts, however, yielded contradictory results [50][51][52][53][54], as they indicated preferences for different acoustic transformations. Crucially, these studies manipulated only one aspect related to bone conduction effects on self-voice (i.e. acoustic transformations) and neglected the additional vibrotactile stimulation. In order to better approximate natural self-voice, experimental self-voice stimuli should be accompanied with the concomitant vibrotactile stimulation resulting from the vibrations of the skull. Here, we address this by providing vibrational input through a bone conduction headset and investigate whether it improves self-voice perception, as opposed to auditory input alone.
In a series of three behavioural studies in independent cohorts, and using a new self-voice perception paradigm, we investigated the following three main perceptual factors of self-other VD: (i) sound conduction type (air versus bone), (ii) other-voice familiarity (familiar versus unfamiliar), and (iii) acoustic voice parameters. Using voice-morphing technology [55] and bone conduction headphones, we designed a psychophysical self-other VD task to investigate the nature of perceptual differences in self-other VD, while trying to avoid ceiling effects. Participants heard short voice morphs of their own and other people's vocalizations ( phoneme /a/) and indicated whether the morphs more closely resembled their own or someone else's voice. In Study 1 (N = 16), we investigated differences in royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221561 self-other VD as a function of sound conduction (air, bone) and how this is modulated by previous exposure to self-voice [34,35,37]; in Study 2 (N = 16), we extended this to familiar-other VD in order to investigate whether the bone conduction effects are specific for self-voice, or generalize to other familiar voices [56,57]. In Study 3, we set out to replicate Studies 1 and 2 within a single, larger cohort (N = 52). We, furthermore, included an additional self-familiar VD task and a control self-voice recognition task (without voice morphing) and investigated the acoustic parameters of all tested voices [18]. We hypothesized that bone conduction would facilitate self-voice perception in self-other VD (bias or increased sensitivity) (Study 1) but would not affect familiar-other VD task (Study 2). We further hypothesized that bone conduction effects would be more prominent without exposure to the self-voice used in our experiment prior to the task-i.e. when the task difficulty is increased (Studies 1 and 2)-and that they would occur regardless of other-voice familiarity [56,57] (Study 3).

Participants
Studies 1 and 2 each involved 16 participants. In Study 1, seven participants were male (mean age ± s.d.: 29.7 ± 5.5 years old) whereas eight were male in Study 2 (28.5 ± 5.5 years old). For Study 3, participants were accompanied by an acquaintance (a friend) of the same gender and similar age, who also participated in the study, and it involved 52 participants (20 male, 26.5 ± 4.6 years old). Nine out of 52 participants were excluded from the main regression analysis in Study 3, based on their low performance in the control task (see Procedure). In Study 3, we recruited participants in pairs as this allowed participants to provide both self-and familiar voices (see Procedure). All participants were right-handed, reported no hearing deficits, and no history of psychiatric or neurological disorders. They were chosen from the general population and were naive to the purpose of the study. Participants gave informed consent in accordance with institutional guidelines (protocol 2015-00092, approved by the Comité Cantonal d'Ethique de la Recherche of Geneva) and the Declaration of Helsinki, and received monetary compensation (CHF 20 h −1 ).
The sample size for Study 3 was selected based on power analysis of Study 1, which indicated that a sample size of n = 47 provides greater than 84% power (95% CI = [75.32, 90.57]) for the interaction between Conduction and Voice Morph with the effect size of −0.12 (100 simulations, α = 0.05).

Study 1: self-other voice discrimination
In Study 1, we morphed each participant's voice with the voice of a gender-matched unfamiliar person. For each voice morph, participants were instructed to indicate whether the voice they heard more closely resembled their own or someone else's voice by pressing on one of two buttons. Based on our previous work [58,59], six voice ratios (% self-voice: 15, 30, 45, 55, 70, 85; figure 1a) were chosen and repeated 10 times within a block in a randomized order (total of 60 trials). The study contained four experimental blocks, which differed based on the sound conduction type (air, bone) and whether participants were exposed to the unmorphed self-voice immediately prior to the experiment. In the first two blocks, participants performed the task without having previously heard the unmorphed recording of their voice, once with each type of sound conduction, whereas before the remaining two blocks the unmorphed self-voice was presented to participants (figure 1b). The order of air-and bone-conduction blocks was counterbalanced across participants and for both parts of the experiment (with and without previous exposure to self-voice).

Study 2: familiar-other voice discrimination
In Study 2, the experimental design (figure 1c) was equivalent to Study 1, except that the self-voice was substituted with the voice of a familiar other. Should the effects in Study 1 be caused by the familiarity with one's own voice and not one's own voice per se (i.e. the other voice was not familiar), then we would expect similar performance in Study 2. Thus, participants heard voice morphs between a familiar other voice and an unfamiliar other voice, either via air or via bone conduction, and either without (first two blocks) or with previous exposure to a familiar voice. In each trial, they indicated whether the corresponding morph more closely resembled the familiar voice or someone else's.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221561 2.2.3. Study 3: self-other, familiar-other, and self-familiar voice discrimination In Study 3, we contrasted self-other and familiar-other VD tasks in the same, independent cohort of participants. Moreover, in the same group of participants, we performed a self-familiar VD task, thereby investigating whether the bone conduction effects persist irrespective of other-voice familiarity. Thus, Study 3 consisted of two parts (figure 1d). In the first part, participants performed two blocks of the self-other VD task (air or bone, cf. Study 1) and two blocks of the familiar-other task (air or bone, cf. Study 2), using the counterbalanced order. This was followed by two blocks of self-familiar VD task (air or bone) that were counterbalanced across participants. Self-familiar blocks were always conducted after self-other and familiar-other blocks to balance the exposure to self-and to familiar voice for their discrimination from the unfamiliar voice, before they were tested against each other.

Study 3: control self-voice recognition task
At the end of Study 3 (figure 1d), participants performed a control self-voice recognition task in which, unbeknown to participants, the stimuli consisted only of unmorphed voices (self, familiar, and unfamiliar), and, as opposed to the VD tasks, all three voices were used as stimuli within the same experimental block. In each trial, participants were instructed to indicate whether the voice they hear resembled their voice by pressing a button. There were two control task blocks, one for each form of sound conduction (air or bone), counterbalanced across participants. Each of the three unmorphed voices was randomly repeated 10 times within the block. As this task served as a control to identify whether participants were able to recognize their unmorphed recorded voice, it was always performed at the end of experiment, so as not to affect the performance in the discrimination tasks by previous exposure to unmorphed voice recordings. were sampled from self-other voice continuum generated with voice-morphing technology. Equivalent voice morphs were used in other discrimination tasks (familiar-other and self-familiar). (b) Study 1 design. Two blocks (with bone and air conduction) of selfother task were first performed without (self-voice icon crossed out) and then with self-voice shown prior to the task ( previous exposure to self-voice). (c) Study 2 design. Two blocks (with bone and air conduction) of familiar-other task were first performed without (familiar-voice icon crossed out) and then with familiar-voice shown prior to the task (previous exposure to familiar voice). (d) Study 3 design. Self-voice and familiar voice were first discriminated against the unfamiliar voice and then against each other. The control task in which self-voice was detected among the three unmorphed voices was conducted at the end of Study 3.

Stimuli and materials
Prior to participating in the studies, participants' voices were recorded while vocalizing the phoneme /a/ for approximately 1 to 2 s (Zoom H6 Handy recorder). Each recording was normalized for average intensity (−12 dBFS) and duration (500 ms) and cleaned from background noise (Audacity software). In detail, immediately before recording participants' voices, we recorded a few seconds of baseline background noise, that was filtered out from the vocalization. The 500 ms clips that were used as stimuli were selected from the utterance by avoiding its onset and offset, and additionally by selecting its most stable part (i.e. 500 ms that do not noticeably vary in sound intensity). Noise reduction parameters in Audacity software were set to default (12 dB, sensitivity: 6, smoothing: 3 bands). Cleaning of background noise did not significantly alter the formants of the voice stimuli. The distance between the recorder and participants was not controlled for, but it was always around 20 cm, and the sound intensity was normalized for each recording, rendering them standardized across participants. In principle, participants vocalized /a/ only once, unless the recording was not of good quality (e.g. too short so that either onset or offset could not be avoided when selecting the 500 ms interval, or varying noticeably in sound intensity across the 1-2 s of the recording). In Studies 1-3, such preprocessed voice recordings were used to generate voice morphs spanning a voice identity continuum between two voices by using TANDEM-STRAIGHT [55] (e.g. a voice morph can be generated such that it contains 40% of person A's and 60% of person B's voice).
In Study 1, the other voice was a voice of a gender-matched unfamiliar person. In Study 2, the familiar voice belonged to a male person with whom participants were acquainted. In Study 3, participants came with a gender-matched acquaintance who also participated in the study and whose voice served as familiar-other voice. The gender-matched unfamiliar voices were the same in all studies.
In Studies 1 and 2, the unmorphed voices in blocks with previous exposure were presented to participants through the same sound conduction type used for that experimental block (air or bone).
In Study 3, as air-conduction medium, we used headphones (Bose QC20) instead of laptop loudspeakers (GIGABYTE AORUS x5, Studies 1 and 2). Both air-and bone-conducting headphones were installed on participants' heads before the beginning of the experiment and matched for loudness at lower sound intensities, such that vibrotactile sensations resulting from bone conduction could not be perceived, resulting in participants being unable to determine the source of the auditory stimuli throughout the experiment. This served as a stricter methodology, as it enables a better concordance in sound intensity and spatial location between the bone-and air-conducted stimuli. However, we did not formally quantify the extent to which the participants were able to determine the source of the auditory stimuli-this was only inferred from participants' reports. Despite this difference in air-conduction medium, i.e. loudspeakers (Study 1) versus headphones (Study 3), we observed similar effects in comparison with bone conduction in both studies (see Results). In all studies, we used the same Aftershokz Sportz Titanium headphones as bone-conducting medium.
In all studies, inter-trial intervals jittered between 1 and 1.5 s to avoid predictability of stimulus onset. All studies were performed in Matlab 2017b with Psychtoolbox library [60].

Statistical analysis 2.4.1. Voice discrimination tasks
In all studies, the data were analysed with binomial mixed-effects regressions with Response as dependent variable, indicating whether participants perceived the presented voice morph as resembling their voice (self-other and self-familiar VD) or the familiar voice (familiar-other VD). In Studies 1 and 2, the regressions contained two fixed effects with an interaction term: Conduction (air, bone) and Previous Exposure (yes, no), as well as a fixed effect of Voice Morph (15, 30, 45, 55, 70, 85%). In Study 3, the effect of sound conduction on each type of VD (self-other, familiar-other, and self-familiar) was analysed with mixed-effects binomial regressions with Response as dependent variable and Conduction (air, bone) and Voice Morph (15,30,45,55,70, 85%), together with a two-way interaction, as fixed effects. For all mixedeffects regressions in all studies, random effects included a by-participant random intercept, and byparticipant random slopes for the main effects were added following model selection based on maximum likelihood. Trials with reaction times greater or smaller than two interquartile ranges from the median for each participant were considered as outliers and excluded. Additionally, a linear mixedeffects regression with Reaction Times as a dependent variable and the same fixed and random effects was performed for all studies, with the polynomial expansion of the Voice Morph variable to level 2 royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221561 (electronic supplementary material). To further validate null findings of the mixed-effects binomial regressions, we performed equivalent models relying on the Bayesian framework. Statistical tests were performed with R [61], using the lme4 [62], lmerTest [63] and cocor [64] packages. The results were illustrated using sjplot [65] and ggplot2 [66] packages. Power analysis was performed with simr [67] package. Bayesian models were created in Stan computational framework (http://mc-stan.org/) accessed with the brms package [68].

Control task and other-to-self-voice confusion
The performance in the control task of Study 3 was also analysed with mixed-effects binomial regressions with Response as dependent variable and two fixed effects with an interaction term: Conduction (air, bone) and Voice (self, familiar, unfamiliar).
For the control task of Study 3, we additionally explored whether self-voice was more misperceived as the familiar or with the unfamiliar voice. For that purpose, we correlated the rate of 'other' response in the self-voice trials (i.e. miss rate) with the rates of 'self' response in both familiar-and unfamiliar-voice trials (i.e. false-alarm (FA) rate). Pearson and Filon's z-test for comparing two correlations based on dependent groups with overlapping variables [69] was used to compare these two correlations (miss rate with two types of FA rates-familiar-as-self and unfamiliar-as-self misperception). The two FA rates were also correlated with each other. Where significant, separate correlations were then conducted for and compared between the two forms of sound conduction (air, bone; electronic supplementary material).

Self-other voice discrimination acoustic analysis
We subsequently investigated whether the physical acoustic parameters that have been shown to account for the discrimination of other voices [18] also impact VD for one's own voice. Participants' unmorphed voices were placed in voice spaces as defined by Baumann & Belin [18], whose axes represent different acoustic parameters of the voices. In this space, similarly sounding voices are located close to each other and inter-voice distances have been correlated with other-voice discriminability [18] and related to the activity in auditory cortex [70]. A two-dimensional voice space was created [18,71,72], with the dimensions corresponding to contributions of source ( pitch, larynx) and filter (formants, vocal tract) in voice production [73] (figure 4a). The voice spaces were normalized such that the origin of the spaces corresponds to the other voice in each self-other voice pair. The distance to the origin thus represents the acoustic difference between self-and other voices.
In detail, for each voice recording, we extracted the fundamental frequency (F0) and five formants (F1-F5) using Praat software [74] and computed its voice-space coordinates, corresponding to source (x coordinate) and filter (y) components of voice production [73] (males: x = log(F0), y = log(F5 -F4); females: x = log(F0), y = log(F1)). The choice of coordinates was based on the work by Bauman & Belin [18], who demonstrated that this combination of acoustic parameters best accounts for subjective discriminability of voices for each gender. As an exploratory analysis, we constructed several different voice spaces by using different acoustic parameters for the y-axis. The results remained unaltered and are placed in the electronic supplementary material. The coordinates were first transformed into z-scores, after which the voice spaces were normalized for the other voice, such that other-voice coordinates were subtracted from self-voice coordinates in each self-other voice pair. This resulted in a coordinate system where Euclidean distance to the origin represented self-other voice distance in z-score units. Z-scoring coordinates enabled us to place all participants (male and female) in the same voice space. Distances to the origin (self-other voice distances) were then correlated with the percentage of correct responses in self-other VD task. In the same way, we created familiar-other voice space and compared familiar-other distances with familiar-other task performances. Significant correlations were run again for and compared between the two forms of sound conduction (air, bone; electronic supplementary material). Acoustic parameters of all participants' voices are reported in the electronic supplementary material.    (d)) for the blocks with immediate previous exposure to the target voice prior to the task. Bone conduction improved self-unfamiliar discrimination only when participants were not previously exposed to their voice before the task (a). No such effects were observed for familiar-unfamiliar discrimination. ÃÃÃ p < 0.001.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221561 psychometric curve fitted for the blocks without previous exposure, and of Voice Morph (estimate = 0.55, Z = 22.67, p < 0.001), indicating that the ratio of 'self' response increased with increased amount of selfvoice present in voice morphs ( figure 2). Moreover, the analysis yielded a significant interaction between Conduction and Previous Exposure (estimate = 0.43, Z = 2.85, p = 0.004). In order to investigate the nature of the interaction, we ran a separate mixed-effects binomial regression for each type of Previous Exposure. The analysis for the blocks without previous exposure to self-voice revealed a Collectively, the results of Study 1 indicate that participants are better at discriminating self-and other voices (i) when voice morphs are presented through bone conduction, (ii) that previous exposure makes the self-other VD task easier, and (iii) that this bone conduction-related enhanced self-perception disappears when subjects are exposed before the task to their own unmorphed voice stimuli.

Study 2: familiar-other voice discrimination
The observed effect of enhanced perception of the self-voice in Study 1 may be caused by the effects of familiarity with one's own voice and not one's own voice per se. Hence, in Study 2, the experimental design (figure 1c) and statistical analysis were equivalent to Study 1, except that the self-voice was substituted with the voice of a familiar other.
As These data show that familiar-other discrimination was not significantly affected by the type of sound conduction or previous exposure to the familiar-voice (figure 2c,d). Thus, they suggest that the effects observed in Study 1 involve self-related processes rather than those of familiarity.

Study 3: self-other, familiar-other, and self-familiar voice discrimination
Results of Studies 1 and 2 showed that the bone conduction effects are specific to self-voice and do not generalize to other familiar voices, respectively, albeit in different groups of participants. Hence, in the first part of Study 3, we contrasted self-other and familiar-other VD tasks in the same, independent cohort of participants ( figure 1d). In addition, in the second part of Study 3, we performed a selffamiliar VD task in the same group of participants, to investigate whether the observed bone conduction effects are dependent on other-voice familiarity (figure 1d).
The results of the three discrimination tasks (self-other, familiar-other, and self-familiar) are illustrated in Overall, these data demonstrate that bone conduction improved the performance in VD tasks if the task involved self-voice morphs, regardless of other-voice familiarity (steeper psychometric curves in self-other and self-familiar, but not in familiar-other task; asterisks in the middle of plots in figure 3). Lower intercepts for bone conduction (asterisks in the left end of plots in figure 3) indicate that this was especially prominent for other-dominant voice morphs (i.e. containing lower rate of self-voice present) [59].

Self-other voice discrimination acoustic analysis
We subsequently investigated whether the physical acoustic parameters that have been shown to account for the discrimination of other voices [18] also impact VD for one's own voice. Participants' unmorphed voices were placed in a self-other voice space in which similar voices are located close to each other and the distance to the origin represents the acoustic difference between self-and other voices.
Correlation analysis indicated a positive association between self-other voice distances and self-other task performance (both for self-familiar and self-unfamiliar tasks) (r = 0.2, 95% CI = [0.01, 0.38]; t 98 = 2.06, p = 0.042; figure 4b), indicating that the same acoustic parameters that have been linked to discrimination of other voices [18] account for VD of the self-voice. Neither sound conduction (air, bone) nor the type of other-voice (familiar, unfamiliar) affected the relationship between task performance and self-other distance (electronic supplementary material). Further analyses related to acoustic properties-gender differences, alternative voice-space constructions, as well as separate contributions of source (larynx) and filter (vocal tract)-are reported in the electronic supplementary material.
Considering that previous work constructed the voice space by using three vowels (/a/, /i/, /u/) that reflect the extremes of the vowel space area [75], whereas we used only /a/, we repeated analysis with an alternative construction of the second voice-space dimension, which is better tailored for the vowel /a/. The results did not change significantly and are reported in the electronic supplementary material.
Acoustic parameters of all participants' voices from all studies are also reported in the electronic supplementary material.   . These data show that, although participants were mostly correct in identifying their own unmorphed voice (79% hit rate), they also misinterpreted both familiar and unfamiliar other voices as their own during some trials (17% and 13% FAs), indicating that recognizing own voice in a recording even without additional transformations (e.g. morphing with other voices) of is not as trivial as it might seem. Thus, nine out of 52 participants who could not recognize their voice in more than half of the self-voice trials (i.e. accuracy lower than 50%) in the control task were considered as outliers and excluded from the analysis in the discrimination tasks above.

Other-to-self-voice confusion
The self-voice recognition task indicated that in a non-negligible amount trials participants misperceived either familiar (17%) or unfamiliar (13%) voice as their own (i.e. responded 'self' for 'other' stimuli-FA),  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221561 but it remains unknown whether the two FA rates (unfamiliar FA and familiar FA) are related to each other (i.e. indicating a general ownership over other voices), and to a decrease in recognition of their own voice (i.e. to the miss rate-responding 'other' to 'self' stimuli-indicating that other voices were confused with selfvoice, a self-voice disownership). Thus, we ran correlation analyses between the two FA rates and miss rate. A significant correlation between the two FAs would indicate that participants who misperceived one type of voice as self-voice also misperceived another, suggesting a general tendency to misperceive other voices as self-voice, regardless of voice familiarity. A significant correlation between miss rate and the type of FA would indicate that another voice is confused as self-voice.
Pearson's product-moment correlation did not show a significant relationship between the two FA rates (r = −0.07, 95% CI = [−0.34, 0.21]; t 50 = −0.49, p = 0.624), showing that participants either misperceived the familiar or the unfamiliar voice as their own, but independently. The correlation analysis identified a significant positive relationship (r = 0.67, 95% CI = [0.48, 0.79]) between miss rate and familiar-FA rate (t 50 = 6.31, p < 0.001), while there was no significant relationship (t 50 = 1.46, p = 0.151) between miss rate and unfamiliar-FA rate (r = 0.2, 95% CI = [−0.07, 0.45]) (figure 5b). Pearson and Filon's z-test identified a stronger relationship between miss and familiar-FA compared with unfamiliar-FA rates (z = 2.86, p = 0.004), indicating that participants were confusing self-voice more with the familiar compared with the unfamiliar voice. This shows that, although familiar-to-self-voice FA rate did not occur significantly more than unfamiliar-to-self FA rate (figure 5a), only familiar-to-self FA rate was related to the miss rate of self-voice (figure 5b), indicating that (only) familiar-other voice is confused as self-voice. This can also be illustrated through a confusion matrix (figure 5c), by suggesting that falsely identified self-voice trials mostly shifted towards familiar, and not towards unfamiliar voice (red arrow on figure 5c). This is because the participants who answered 'self' for the familiar voice did not answer 'self' for the actual self-voice, while participants who answered 'self' for the unfamiliar voice also answered 'self' for the actual self-voice. The first category of participants thus confused familiarvoice as self-voice, whereas the second category probably had a bias of answering 'self'. No correlations were affected by sound conduction type (electronic supplementary material).
Collectively, these results suggest that we are prone to confusing a familiar voice as our own, seemingly because it is familiar, and regardless of the acoustic similarity between the two voices. This sheds new light on the effects of familiarity present in stimuli associated with the self.  Figure 5. Study 3: control task. The bar plot (a) indicates mean rates of 'self' responses occurring for each type of voice stimuli-hit rate for self-and false-alarm (FA) rates for familiar and unfamiliar voices-whereas the regression plots (b) indicate relationships between FA rates for familiar and unfamiliar voice with the miss rate for self-voice. Bar plot whiskers and shaded areas around linear regressions indicate 95% confidence intervals. Although the absolute rate at which familiar and unfamiliar voices was misperceived as self-voice did not differ (a), only the familiar-voice misperception was related to self-voice (b). In a confusion matrix (c), that maps three vocal stimuli (self, familiar, unfamiliar; rows) to participants' responses ('self', 'other'; columns), this could be represented as a shift of falsely labelled self-voice trials mostly towards familiar voice (red arrow). ÃÃÃ p < 0.001. royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221561

Discussion
As our main means of verbal communication, our voice is an integral part of our identity and our self. Although traditionally thought of as a purely auditory signal, self-voice is a multi-modal percept that also involves vibrotactile input [32,33], at least when we are actively speaking. Perceiving our voice passively, that is, presented through air-conduction loudspeakers, differs in two ways: (i) it sounds different, since it is not (low-pass) filtered as a result of passing through the skull and (ii) it lacks multi-modal input resulting from speech production. Conversely, this leads to a reduced ability to recognize ourselves in air-conducted recordings [16,36,37,[41][42][43]]. This has significantly impeded selfvoice-related research, rendering it one of the least investigated aspects of self-consciousness [1]. While previous work has tried approximating the natural self-voice by applying acoustic transformations [50][51][52][53][54], we here focused on the multi-modal aspect of the self-voice by presenting stimuli through a commercially available bone conduction headset. This allowed us to pinpoint perceptual specificities related to self-voice, ranging from low-level acoustic to high-level cognitive aspects such as familiarity and previous exposure.

Vibrotactile stimulation
Studies 1 and 3, demonstrated that self-other VD improves with bone compared with air conduction. As we argue below, this demonstrates the importance of vibrotactile signals generated by the bone conduction vibrations in addition to its low-frequency filtering of the auditory signal or voice familiarity.
Acoustic transformations resulting from bone conduction play an important role in self-voice perception. They might constitute an internal model of what the self-voice should sound like, and thus hearing our voice through bone as opposed to air conduction might better approximate this model, resulting in higher performance in the self-other VD task. However, acoustic transformations are difficult to manipulate experimentally, as the exact transfer function of the skull and other head tissues has not yet been formally defined and remains a topic of ongoing research in acoustics [31,76]. Studies that tried to experimentally alter the sound of the air-conducted self-voice in order to render it more similar to the bone-conducted one yielded contradictory and sometimes inconsistent results [50][51][52][53][54]. We therefore opted for presenting self-voice stimuli directly through bone conduction. This results in both acoustic transformations of the sound of our voice and in accompanying vibrotactile stimulation, which has been neglected in previous studies.
Importantly, the bone conduction effect was specific to the involvement of the self-voice, and not for other familiar voices, as bone conduction did not improve familiar-other VD (Studies 2 and 3). This separates the effect of bone conduction from the effect of voice familiarity: if bone conduction effects were related to neural mechanisms associated with familiarity, then similar differences should have occurred in the familiar-other task. As this was not the case, it is likely that the bone conduction effects are specific to self-voice, not because self-voice is a stimulus we are familiar with but because it involves a dedicated neural system associated with the self. Moreover, a lack of perceptual differences between air and bone conduction in the familiar-other task further supports the importance of vibrotactile stimulation accompanying bone conduction, as opposed to physical transformations to self-voice stimuli (e.g. deeper sound due to filter properties of bone tissue). Namely, if only physical transformations were to account for the bone conduction effect in the self-other task, the opposite effect should be observed in the familiar-other task (i.e. a disadvantage of bone conduction for familiar voices), given that we are not used to hearing familiar voices transformed in such a way (e.g. deeper voices of acquaintances). As equivalent acoustic transformations applied by the same bone conduction headset led to perceptual differences for self-voice and not familiar-voice stimuli, we suggest that concomitant vibrotactile signals, usually exclusive to one's own voice, play an important role in self-voice processing. This suggests that self-voice is a multi-modal construct and consolidates evidence in favour of self-related processes in the perception of voices as it has been shown for the perception of faces [14,77].

Previous exposure
Our results further demonstrate better performance in self-voice tasks in people with higher previous exposure to self-voice via recordings as has previously been demonstrated, e.g. in radio announcers [34]. By targeting the effect of previous exposure of self-voice in a more controlled way-in half of royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221561 experimental blocks of Study 1-the unmorphed self-voice was shown to participants before the selfother VD task. This was done in the second half of the experiment so as not to affect the performance in the first half of experiment. Confirming previous findings [34,35], we found that prior exposure facilitated self-other VD. Participants' performance improved when they heard their unmorphed voice recording immediately before the task: by hearing an unmorphed recording of their voice prior to task execution, participants could have created an arbitrary strategy of recognizing that specific voice recording in a voice morph, regardless of whether they associated the recording with themselves or not. For instance, when hearing two unmorphed voices before the task, participants could focus on one acoustic property (e.g. a higher pitch in one of the recordings) and use that property as a reference against which voice morphs are compared. Without previous exposure, however, there was no pre-exposure-based additional reference that participants could rely on to complete the task, and they had to rely on their internal self-voice representation. The fact that there was no effect of previous exposure on the familiar-other task in Study 2 also suggests that the effect of previous exposure in Study 1 did not occur because it was manipulated only in the second half of the experiment, controlling for task habituation effects.
The addition or omission of pre-exposure to the own voice is also important for understanding the contribution of bone conduction in Study 1. Bone conduction only improved self-other VD when this discrimination was based on an internal representation of the own voice as opposed to comparing it with the pre-exposure stimulus, which essentially rendered the task easier. This is supported by the familiar-other VD findings (Study 2), which showed that bone conduction did not affect performance in blocks with familiar versus unfamiliar voices (with and without previous exposure). This suggests that bone conduction facilitation is only found for the self-voice, but not for familiar voices that are mainly based on auditory cues.

Familiarity and acoustic parameters
We performed additional analyses demonstrating that both familiarity processing and acoustic differences contribute to self-other VD. On the one hand, the results of our correlation analysis between miss and FA rates in the control self-voice recognition task show that self-voice perception inevitably involves some familiarity processing. Thus, a failure to recognize own voice (miss rate) was correlated with familiar-to-self, and not with unfamiliar-to-self-voice misattribution (FAs), regardless of acoustic similarity between the three voices. This suggests that self-voice was more confused with a familiar than with an unfamiliar other voice and that familiarity mechanisms [56,57] also bias selfvoice perception. An analogy in the visual domain would be observing that participants confuse their own face more with a familiar as opposed to an unfamiliar other face, despite unfamiliar face being physically more similar (e.g. based on eye colour or nose shape). On the other hand, our voice-space analysis indicates that, to a certain extent, also low-level acoustic properties have an impact on selfvoice recognition (for detailed results and discussion, see electronic supplementary material). Without 'a priori' hypotheses, we placed our participants' voices in other-centred voice spaces [18] and observed a correlation between acoustic distances and discriminability ratings. This supports the involvement of a third factor, low-level acoustic processes, in self-other VD and shows that the acoustic differences accounting for discrimination of other voices extend to self-voice, that should be further explored (see electronic supplementary material). In sum, these findings show that both familiarity mechanisms and acoustic processes contribute to self-voice perception, and future studies should identify ways to delineate the corresponding contributions of these factors.

Task sensitivity
While previous self-voice tasks have been characterized by ceiling effects, the paradigm proposed here is able to capture inter-subject variability, which allows us to dissect perceptual specificities (e.g. a bias, general sensitivity, or effects specific to self-or other-voice perception) and personalize studies of selfother VD. Most participants in Studies 1 and 3 (as well as in our follow-up EEG study [59]) spontaneously reported that they perceived the self-other VD task to be very difficult and showed poor metacognition; that is, they misjudged their ability of successfully performing the task. Moreover, in Study 1, we observed large differences in performance across participants. That is why, in Study 3, in addition to increasing the sample size based on the power analysis of Study 1 (see Methods), we introduced a control task at the end of experiment, to narrow down the self-other VD analysis to include only those participants able to recognize their own unmorphed voice. To our surprise, nine royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221561 out of 52 (17.3%) participants could not recognize their unmorphed voice in more than half of selfrecognition task's trials, and were thus excluded from the analysis in the VD tasks. This shows that recognizing the own voice in short vocalizations (e.g. phoneme /a/ lasting for 500 ms) is not as trivial as it might seem (even without voice morphing), although it is shown to suffice for speaker identification [78]. Moreover, a decrease in difficulty between the control self-voice recognition task and self-other VD task might account for a lack of differences between air and bone conduction in the control task. Namely, it is possible that the bone conduction advantages for self-voice perception are detectable only in tasks that are sensitive enough to detect them, which was probably not the case for the control task not involving voice morphing. The bone conduction advantage also disappeared in the blocks with previous exposure in Study 1. As indicated above, previous exposure enabled participants to have other strategies to perform self-other VD task, which probably made the task easier. This could have led to similar ceiling effects in which the bone conduction advantages are not perceivable. Importantly, a change in bone conduction effects was not observed in Study 2, which manipulated previous exposure in the same way as Study 1, but for familiar-unfamiliar VD. This suggests that task difficulty matters for the bone conduction effects, but only when tasks involve self-voice.

Other-dominant voice morphs
Although bone conduction improved the performance in self-other VD both in Studies 1 and 3 (a steeper slope of the psychometric curve), the specific morphs for which the difference in performance was biggest differ between the studies (self-dominant morphs for Study 1, figure 2a; other-dominant for Study 3, figure 3a). We believe that this difference is mainly due to (i) a smaller sample size in Study 1 (N1 = 16, N3 = 43) and (ii) a poorer sound quality due to a different air-conducting medium (laptop loudspeakers in Study 1 as compared with headphones in Study 3). We believe that the bone conduction effect is indeed specific for other-dominant self-other voice morphs, as it occurred in Study 3 for two different tasks (self-other and self-familiar), and, importantly, it was replicated in a follow-up EEG study with an independent cohort of participants performing the same self-other task with five times more trials [59]. This suggests that, rather than labelling an ambiguous voice as 'self', bone conduction facilitates discarding an ambiguous voice as being 'not self'. In other words, our data show that bone conduction specifically facilitates making a 'not self' judgement in scenarios of vocal ambiguity.

Motor signals
It is important to note that natural self-voice perception also involves motor signals related to speech production, that were not tested here. The presence of such motor signals and the associated intraoral and pharyngeal sensory speech-related cues may also exert additional effects on self-other VD, as opposed to familiar-other VD. An equivalent study in which participants would hear self-other voice morphs triggered by vocalization onset could investigate the potential role of motor signals, in addition to or as opposed to vibrotactile effects. This would, however, be challenging to investigate, as speech-related motor signals and the resulting vibrotactile sensations cannot be experimentally separated during natural speech.

Headset frequency response
The observed bone conduction effects might partially be accounted for by the differences in frequency responses between bone-and air-conduction headsets. Namely, it is possible that our bone conduction headset has a low-frequency emphasis that renders self-voice more familiar to the listeners, thereby increasing the self-other VD performance. To verify that account, we would have to measure and compare the frequency responses of our headsets. However, Manning et al. [79] measured the frequency response of the exact same bone conduction headset used in our studies and observed it to be quite flat, even for the higher frequencies. By contrast, the response of the air-conduction headset in this latter study had a marked low-frequency emphasis, that could be expected for a bone conduction headset. Importantly, however, even if the sound from both headsets had the same frequency response, the sound that enters the inner ears coming from these two sources (outer ears for air-conduction headset and cheekbones for bone conduction) will always differ in frequency response, with the bone-conducted sound being filtered by the skull and other tissues in the head.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221561 As the exact transfer function of the head is still unknown [31,76], and previous attempts of filtering the air-conducted sound did not yield conclusive results [50][51][52][53][54], it remains difficult to isolate the contribution of acoustic filtering in present effects, especially with respect to the contribution of concomitant vibrotactile stimulation.

Impact and clinical relevance
The impact of this work is threefold. First, by shifting the classical perspective on self-voice away from purely auditory to multi-modal, these findings incorporate self-voice into multi-sensory accounts of selfconsciousness [80][81][82][83]. According to these accounts, the sense of self is fundamentally based on the continuous integration of multi-sensory bodily signals, including tactile, proprioceptive, interoceptive, visual, and auditory signals [10,81,[84][85][86] . Correspondingly, we show that integration of auditory and vibrotactile signals increases the recognition of our voice, that is an integral part of our self. Second, by introducing a method which improves auditory self-identification, we propose a new approach to addressing self-voice-related research questions. Based on these findings, future studies can avoid presenting self-voice stimuli through traditional air-conducting media, especially considering the increasing availability of bone conduction headsets. Finally, this work could serve as a scaffold for clinical investigations of a very common [23,24] and highly distressing [25,26] psychiatric symptom-AVH, i.e. 'hearing voices'-as they have been proposed to arise as a self-other VD deficit [19,22,[87][88][89]. Specifically, characterizing differences in self-other VD curves in voice-hearers compared with controls (e.g. quantifying a bias to hear other-voice and relating it to clinical measures) could deepen the understanding as well as challenge this prominent account for AVH aetiology. Collectively, our findings demonstrate the importance of bone conduction with respect to self-voice perception and shed new light on the phenomenology of the self by portraying self-voice as a fundamentally multi-modal composition in which both familiarity and acoustic properties play a significant role.
Data accessibility. The datasets generated and/or analysed during the current study are available in the open science framework (OSF) repository: https://osf.io/uxvh7/.
The data are provided in the electronic supplementary material [90].